Biting Fly Analysis

Group I

Aditya Negi, Harsh Gupta, Yash Gupta

2023-04-11

Why Biting Fly? Theoretical Background

Details

In particular:

Original Study

Our Study

Our dataset is a truncated version of the original; of the original 13 variables, we have only 7. Our goals are twofold.

  1. We want to evaluate the validity of the original statistical analysis. In particular, do the assumptions in the original hold? Do we obtain the same results as they do with a smaller subset of the variables?

  2. We would like to use techniques which were not used in the original analysis to obtain more insight.

Exploratory Data Analysis

The data consists of 70 rows and 8 columns. They are as follows:

  1. Group: 0 for L. torrens, 1 for L. carteri.

  2. Wing length, shortened to “WL”

  3. Wing width, shortened to “WW”

  4. Third palp length, shortened to “TPL”. (Palps are organs between the antennae responsible for smell, touch, and taste)

  5. Third palp width, shortened to “TPW”.

  6. Fourth palp length, shortened to “FPL”.

  7. Length of twelfth antennal segment, shorted to “ASTW”

  8. Length of thirteenth antennal segment, shortened to “ASTH”.

Except from Group, which is categorical, all of these variables are numeric.

Glance at the Data

##   Group WL WW TPL TPW FPL ASTW ASTH
## 1     0 85 41  31  13  25    9    8
## 2     0 87 38  32  14  22   13   13
## 3     0 94 44  36  15  27    8    9
## 4     0 92 43  32  17  28    9    9
## 5     0 96 43  35  14  26   10   10
## 6     0 91 44  36  12  24    9    9
## 'data.frame':    70 obs. of  8 variables:
##  $ Group: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WL   : int  85 87 94 92 96 91 90 92 91 87 ...
##  $ WW   : int  41 38 44 43 43 44 42 43 41 38 ...
##  $ TPL  : int  31 32 36 32 35 36 36 36 36 35 ...
##  $ TPW  : int  13 14 15 17 14 12 16 17 14 11 ...
##  $ FPL  : int  25 22 27 28 26 24 26 26 23 24 ...
##  $ ASTW : int  9 13 8 9 10 9 9 9 9 9 ...
##  $ ASTH : int  8 13 9 9 10 9 9 9 9 10 ...

Box Plots

As a first step, we’re interested in comparing the variables in the two distinct datasets.

Density Plot

Comments

##    Group  WL WW TPL TPW FPL ASTW ASTH
## 34     0  90 40  37  12  22    9   10
## 35     0 104 46  37  14  30   10   10
## 36     1  86 19  37  11  25    9    9
## 37     1  94 40  38  14  31    6    7
## 38     1 103 48  39  14  33   10   10

Correlation plot

Comments

Data visualisation using PCA

With seven different variables, it is difficult to directly guess about “separation” between the two groups. Graphing the first few principal components may indicate clear differences and suggest more detailed investigation.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     7.4317 3.5747 3.0654 2.56232 1.50634 1.28018 0.44094
## Proportion of Variance 0.6271 0.1451 0.1067 0.07455 0.02576 0.01861 0.00221
## Cumulative Proportion  0.6271 0.7722 0.8789 0.95342 0.97918 0.99779 1.00000

Scree Plot

Biplot

Score Plot

Clearly, there is some evidence that the groups are “separate”, but plenty of observations from group 1 are mixed in with group 0. Does plotting the third principal component help?

3D Plot

It doesn’t seem that adding a third principal component to the visualisation helps much.

Factor Analysis

According to the PCA, two linear combinations of the variables explain most of the variation in the data. This suggests that factor analysis might be helpful.

We quickly verify that the data may indeed be usefully factor-analysed using the KMO test and Bartlett’s test for sphericity.

KMO(fly[, -1])
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = fly[, -1])
## Overall MSA =  0.69
## MSA for each item = 
##   WL   WW  TPL  TPW  FPL ASTW ASTH 
## 0.76 0.77 0.74 0.81 0.76 0.58 0.56
cortest.bartlett(fly[, -1])
## R was not square, finding R from data
## $chisq
## [1] 171.5042
## 
## $p.value
## [1] 1.318656e-25
## 
## $df
## [1] 21

Principal Components Factor Method

First, we perform principal component factor analysis. We can do this even though we have not yet verified the assumption of multivariate normality. Because two principal components explain most of the variation in the data, we suspect that two factors may be enough.

## Principal Components Analysis
## Call: principal(r = fly[, -1], nfactors = 2, rotate = "none", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##       PC1   PC2   h2    u2 com
## WL   0.84 -0.10 0.72 0.285 1.0
## WW   0.72 -0.22 0.57 0.425 1.2
## TPL  0.54 -0.37 0.43 0.570 1.8
## TPW  0.55 -0.26 0.37 0.634 1.4
## FPL  0.64 -0.45 0.60 0.396 1.8
## ASTW 0.60  0.72 0.88 0.117 1.9
## ASTH 0.58  0.75 0.90 0.099 1.9
## 
##                        PC1  PC2
## SS loadings           2.93 1.54
## Proportion Var        0.42 0.22
## Cumulative Var        0.42 0.64
## Proportion Explained  0.65 0.35
## Cumulative Proportion 0.65 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  28.72  with prob <  0.00035 
## 
## Fit based upon off diagonal values = 0.93

Biplot

Rotation

Next, we attempt a varimax rotation.

## Principal Components Analysis
## Call: principal(r = fly[, -1], nfactors = 2, rotate = "varimax", scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
##       RC1   RC2   h2    u2 com
## WL   0.76  0.37 0.72 0.285 1.5
## WW   0.73  0.21 0.57 0.425 1.2
## TPL  0.66 -0.02 0.43 0.570 1.0
## TPW  0.60  0.08 0.37 0.634 1.0
## FPL  0.78 -0.03 0.60 0.396 1.0
## ASTW 0.11  0.93 0.88 0.117 1.0
## ASTH 0.08  0.95 0.90 0.099 1.0
## 
##                        RC1  RC2
## SS loadings           2.52 1.96
## Proportion Var        0.36 0.28
## Cumulative Var        0.36 0.64
## Proportion Explained  0.56 0.44
## Cumulative Proportion 0.56 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  28.72  with prob <  0.00035 
## 
## Fit based upon off diagonal values = 0.93

Biplot

The varimax rotation is easier to interpret. The first factor is heavily loaded on the non-antennal segment data, while the second factor is highly loaded on antennal length data. Hence we can roughly call the first factor the “non-antennal factor” and the second the “antennal factor”.

Summary

We can summarise our factor loadings as follows:

Factor analyses were attempted for 3 and 4 factors, but they yielded unsatisfactory results.

Assumptions for MANOVA

  1. The observations from each population should be independently drawn.

  2. Each population is multivariate normal.

  3. The populations should have a common covariance matrix \(\Sigma\).

The first of these assumptions we may take to be true, because these are observations from different specimens. It remains to verify 2) and 3).

Assumption of Multivariate Normality

Before going to test multivariate normality, we will first test univariate normality of each of the variables.

Wing Length

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 2] and Group 
## 
##   Level Statistic     p.value   Normality
## 1     0 0.9510978 0.122472063  Not reject
## 2     1 0.8840922 0.001517748      Reject
## --------------------------------------------------

Wing Width

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 3] and Group 
## 
##   Level Statistic      p.value   Normality
## 1     0 0.9444748 7.666707e-02  Not reject
## 2     1 0.6613229 9.837494e-08      Reject
## --------------------------------------------------

Third Palp Length

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 4] and Group 
## 
##   Level Statistic    p.value   Normality
## 1     0 0.9210176 0.01525362      Reject
## 2     1 0.9663625 0.35136571  Not reject
## --------------------------------------------------

Third Palp Width

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 5] and Group 
## 
##   Level Statistic    p.value   Normality
## 1     0 0.9328069 0.03392550      Reject
## 2     1 0.9343470 0.03773533      Reject
## --------------------------------------------------

Fourth Palp Length

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 6] and Group 
## 
##   Level Statistic     p.value   Normality
## 1     0 0.9641591 0.303597278  Not reject
## 2     1 0.8987116 0.003651696      Reject
## --------------------------------------------------

Antennal Segment 12

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 7] and Group 
## 
##   Level Statistic      p.value   Normality
## 1     0  0.796832 1.769595e-05      Reject
## 2     1  0.918061 1.254047e-02      Reject
## --------------------------------------------------

Antennal Segment 13

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : fly[, 8] and Group 
## 
##   Level Statistic      p.value   Normality
## 1     0 0.7963882 1.734819e-05      Reject
## 2     1 0.9095155 7.197832e-03      Reject
## --------------------------------------------------

Comments

Most of the variables fail to demonstrate any kind of univariate normality. We must resort to some transformation of the data to attain normality.

Box-Cox Transformations

Firstly, we will remove observation 36, which was a prominent outlier with regards to wing width.

flyr<-fly[-36, ]
transformed.flyr<-flyr

Now, we will apply the Box-Cox transformation to each variable.

Wing Length

boxcox(lm(data = flyr, WL~Group),lambda = seq(-2.5,7.5,1/10))

# lambda = 2  (MLE)
transformed.flyr$WL <- bcPower(flyr$WL,2)
nor.test(WL~Group,data=transformed.flyr, plot=NULL,alpha=0.01)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.01) 
## -------------------------------------------------- 
##   data : WL and Group 
## 
##   Level Statistic     p.value   Normality
## 1     0 0.9439853 0.074064433  Not reject
## 2     1 0.8935651 0.003114805      Reject
## --------------------------------------------------

Wing Width

transformed.flyr$WW = bcPower(flyr$WW, 3)
nor.test(WW~Group,data=transformed.flyr,plot=NULL, alpha=0.01)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.01) 
## -------------------------------------------------- 
##   data : WW and Group 
## 
##   Level Statistic   p.value   Normality
## 1     0 0.9526625 0.1367946  Not reject
## 2     1 0.9650367 0.3387956  Not reject
## --------------------------------------------------

Third Palp Length

transformed.flyr$TPL = bcPower(flyr$TPL,0.5)
nor.test(TPL~Group,data=transformed.flyr,plot=NULL, alpha=0.01)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.01) 
## -------------------------------------------------- 
##   data : TPL and Group 
## 
##   Level Statistic    p.value   Normality
## 1     0 0.9152249 0.01041137  Not reject
## 2     1 0.9675149 0.39644333  Not reject
## --------------------------------------------------

Third Palp Width

transformed.flyr$TPW = bcPower(flyr$TPW,0)
nor.test(TPW~Group,data=transformed.flyr,plot=NULL, alpha=0.01)
## 
##   Shapiro-Wilk Normality Test (alpha = 0.01) 
## -------------------------------------------------- 
##   data : TPW and Group 
## 
##   Level Statistic    p.value   Normality
## 1     0 0.9454765 0.08228744  Not reject
## 2     1 0.9241424 0.02119192  Not reject
## --------------------------------------------------

Comments

Fourth Palp Length

boxcox(lm(flyr$FPL~flyr$Group),lambda = seq(-2,2,1/10))

# lambda =1

Antennal Segment 12

boxcox(lm(flyr$ASTW~flyr$Group),lambda = seq(-2.5,2.5,1/10))

# lambda =1

Antennal Segment 13

#as13
boxcox(lm(flyr$ASTH~flyr$Group),lambda = seq(-2.5,2.5,1/10))

# lambda =1

Comments

Jittering

## 
##   Shapiro-Wilk Normality Test (alpha = 0.01) 
## -------------------------------------------------- 
##   data : ASTW and Group 
## 
##   Level Statistic     p.value   Normality
## 1     0 0.9046967 0.005301109      Reject
## 2     1 0.9568060 0.195926659  Not reject
## --------------------------------------------------
## 
##   Shapiro-Wilk Normality Test (alpha = 0.01) 
## -------------------------------------------------- 
##   data : ASTH and Group 
## 
##   Level Statistic    p.value   Normality
## 1     0 0.9465347 0.08867869  Not reject
## 2     1 0.9300098 0.03130258  Not reject
## --------------------------------------------------

Comments

Now that we have added jitter, ASTW & ASTH variables are also showing univariate normality. We may interpret the jitter to represent “rounding error” in the measurement.

Detecting Other Outliers

We try to detect outliers in our transformed data matrix.

Comments

Multivariate Normality

We test this assumption using graphical methods (gamma plots) and hypothesis testing (Henze-Zirkler test of multivariate normality).

Gamma Plot For L. torrens

## [1]  2 15

Gamma Plot For L. carteri

## 51 46 
## 15 10

comments

Henze-Zirkler test for Multivariate Normality

##             Henze-Zirkler test for Multivariate Normality 
## 
##   data : tA[, -c(1)] 
## 
##   HZ              : 0.9847339 
##   p-value         : 0.02539919 
## 
##   Result  : Data are not multivariate normal (sig.level = 0.05)
##             Henze-Zirkler test for Multivariate Normality 
## 
##   data : tB[, -c(1)] 
## 
##   HZ              : 0.8977875 
##   p-value         : 0.5043688 
## 
##   Result  : Data are multivariate normal (sig.level = 0.05)

Comments

The statistical hypothesis testing shows that we are justified in assuming multivariate normality at 1% level of significance. So, we will assume the multivariate normality for transformed data for both the groups.

Homogeneity of covariance matrices using Box_M Test

Being done with Multivariate Normality, now we proceed to test of homogeneity of Covariance Matrices of the two Groups.

\(H_0 : \Sigma_1=\Sigma_2\) vs \(H_1 : \Sigma_1 \ne \Sigma_2\)

rstatix::box_m(transformed.flyr[, 2:8], transformed.flyr$Group)
## # A tibble: 1 × 4
##   statistic p.value parameter method                                            
##       <dbl>   <dbl>     <dbl> <chr>                                             
## 1      45.6  0.0191        28 Box's M-test for Homogeneity of Covariance Matric…

Comment

MANOVA

With the assumptions of MANOVA satisfied, we can proceed with testing the equality of the two mean vectors.

\(H_0 : \mathbf{\mu_1} = \mathbf{\mu_2}\) vs \(H_1 : \mathbf{\mu_1} \ne \mathbf{\mu_2}\)

summary(manova(cbind(WL, WW, TPL, TPW, FPL, ASTW, ASTH)~Group, transformed.flyr))
##           Df  Pillai approx F num Df den Df    Pr(>F)    
## Group      1 0.49368   8.4967      7     61 3.189e-07 ***
## Residuals 67                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comment

ANOVA for individual variables

Note that for comparing two populations, ANOVA is equivalent to a two-sample t-test (equal variances).

Wing Length

##             Df   Sum Sq Mean Sq F value Pr(>F)  
## Group        1  1708484 1708484   5.312 0.0243 *
## Residuals   67 21548145  321614                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Wing Width

##             Df    Sum Sq   Mean Sq F value Pr(>F)  
## Group        1 1.532e+08 153221530   5.777  0.019 *
## Residuals   67 1.777e+09  26521301                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Third Palp Length

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        1  7.376   7.376   42.66 1.03e-08 ***
## Residuals   67 11.584   0.173                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Third Palp Width

##             Df Sum Sq  Mean Sq F value Pr(>F)
## Group        1 0.0066 0.006629   0.503  0.481
## Residuals   67 0.8830 0.013179

Fourth Palp Length

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        1  352.1   352.1   25.91 3.11e-06 ***
## Residuals   67  910.4    13.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Antennal Segment 12

##             Df Sum Sq Mean Sq F value Pr(>F)
## Group        1   0.26  0.2647   0.196  0.659
## Residuals   67  90.35  1.3485

Antennal Segment 13

##             Df Sum Sq Mean Sq F value Pr(>F)
## Group        1   0.39  0.3939   0.308  0.581
## Residuals   67  85.61  1.2778

Comments

The above results exactly match the results in the original 1975 analysis performed by Wiliam Atchley!

Profile Analysis on Transformed Data

Assumptions

  1. All variables must be in similar units.
  2. Observations of different groups are independent.
  3. Variables of each group are distributed as multivariate normal with same covariance matrices.

As all the above assumptions are satisfied for given data, we can proceed to test if the profiles are parallel.

Hypothesis Testing

\(H_0:\) Profiles are parallel. i.e. \(\mu_{1i}-\mu_{1i-1} = \mu_{2i}-\mu_{2i-1}; i=1,2,...,p.\)

\(H_1:\) Profiles are not parallel.

Testing of Hypothesis

\(\textbf{Decision:}\) Reject \(H_0\) (parallel profile )at level \(\alpha\) if \[T^2 = (\bar{x_1}-\bar{x_2})'C' \left[ \left( \frac{1}{n_1} + \frac{1}{n_2} \right) CS_{pooled}C' \right]^{-1} C (\bar{x_1}-\bar{x_2}) > c^2 \] where, \[c^2 = \frac{(n_1 + n_2 -2)(p-1)}{n_1+n_2-p} F_{p-1,n_1+n_2-p} (\alpha) \]

## Calculation of p_value
n1=35
n2=34
p=7
C = matrix(c(-1,1,0,0,0,0,0,
             0,-1,1,0,0,0,0,
             0,0,-1,1,0,0,0,
             0,0,0,-1,1,0,0,
             0,0,0,0,-1,1,0,
             0,0,0,0,0,-1,1),
             nrow=6,ncol=7,byrow=TRUE)
S_pooled = 1/(n1+n2-2)*((n1-1)*S.tA+(n2-1)*S.tB)
# Test statistic
T_square = t( C %*% (txbar-tybar) ) %*% solve((1/n1+1/n2)* C %*% S_pooled %*% t(C)) %*% C %*% (txbar-tybar) 

c_square = (n1+n2-2)*(p-1)/(n1+n2-p)
p_value = pf (T_square/c_square,p-1,n1+n2-p, lower.tail = FALSE)
print(p_value)
##              [,1]
## [1,] 9.933596e-08

Comments

The null hypothesis is rejected at \(\alpha = 0.01\); i.e. the profiles are not parallel.

Linear Discriminant Analysis

Assumptions

We have assumed that

We have already shown that

Since all assumptions are satisfied, we’ll use the LDA the transformed data in two different ways:

  1. By splitting data into training and testing sets

  2. Using Leave one out cross validation.

Train-Test

model.lda <- lda(Group~., data = train.data)
model.lda
## Call:
## lda(Group ~ ., data = train.data)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##         WL       WW       TPL      TPW      FPL     ASTW     ASTH
## 0 4747.839 27242.30  9.953778 2.663342 25.96429 9.455446 9.432200
## 1 5041.339 29573.69 10.602066 2.693169 30.07143 9.828678 9.827236
## 
## Coefficients of linear discriminants:
##                LD1
## WL   -3.954937e-04
## WW   -7.093095e-06
## TPL   2.405805e+00
## TPW  -2.598409e+00
## FPL   1.476042e-01
## ASTW  1.820921e-01
## ASTH -1.733004e-01

LDA Score Plot (1-dimensional)

Confusion matrix

Confusion matrix on training data

Comment On training data set, APER (misclassification rate) of LDA is about 14%.

Confusion matrix based on testing data

Comment: On testing data set, misclassification rate of LDA is about 23%.

Cross-Validation

Confusion Matrix

## [1] 0.2318841

Comments:

Quadratic Discriminant Analysis

## [1] 0.2463768

Comments

Estimate of AER = 17/69 = 0.25

\(\hat{\mathbb{P}}[L. torrens|L. carteri]\) = 8/34 = 0.236

\(\hat{\mathbb{P}}[L. carteri|L. torrens]\) = 9/35 = 0.257

Both LDA and QDA have more or less the same misclassification rate. It’s not bad, but can we do any better?

KNN

Normalisation

Before applying KNN, data should be normalized. This ensures that each feature has the same influence on the distance calculation and prevents the KNN algorithm from being biased towards certain features.

#Normalization
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }
normlized.df <- as.data.frame(lapply(flyr[,2:8], normalize))

Confusion Matrix

Training

Testing

Comment

Cross-Validation

Estimated AER for k=1,2,…,10

confusion.knn.cross =list()
for( i in 1:10){
  confusion.knn.cross[[i]] = confusionMatrix(data=as.factor(as.integer(knn.cross.labels[,i])-1),reference=as.factor(flyr$Group))
}


knn.aer=matrix(0,nrow=10,ncol=1)
rownames(knn.aer) = c("k=1","k=2","k=3","k=4","k=5","k=6","k=7","k=8","k=9","k=10")
colnames(knn.aer) = c("Estimate of E[AER]")
for( i in 1:10) knn.aer[i,1]=round(1-confusion.knn.cross[[i]]$overall[1],3)

#Estimate of AER of KNN for different values of k
knn.aer %>% 
  knitr::kable(format = "html")
Estimate of E[AER]
k=1 0.116
k=2 0.116
k=3 0.130
k=4 0.116
k=5 0.087
k=6 0.130
k=7 0.087
k=8 0.145
k=9 0.145
k=10 0.116

Comment For k=5, estimate of AER for KNN is approximately 0.09 which is better than that of LDA.

Confusion Matrix for k=5

draw_confusion_matrix(confusion.knn.cross[[5]])

Comment

Estimate of AER = 6/69 = 0.087

\(\hat{\mathbb{P}}[L. torrens|L. carteri]\) = 6/34 = 0.176

\(\hat{\mathbb{P}}[L. carteri|L. torrens]\) = 0/35 = 0

Logistic Regression

Total Misclassification Rate

ma_model <- glm(Group~., family = binomial(link = "logit"), data=fly)
errvec<-rep(0, 70)
for (i in 1:nrow(fly)){
errvec[i]<-(predict(ma_model, fly[i, -1], type = "response")>0.5)
}
sum(errvec!=fly[, 1])/nrow(fly)
## [1] 0.08571429

Leave-One-Out Cross Validation

errvec<-rep(0, nrow(fly))

for (i in 1:nrow(fly)){
  model<-glm(Group~., family = binomial(link = "logit"), data=fly[-i, ])
  errvec[i]<-(predict(model, fly[i, -1], type = "response")>0.5)
}

sum(errvec!=fly[, 1])/nrow(fly)
## [1] 0.1428571

Conclusions

  1. Our first goal was to verify the results of the original analysis. This goal has borne rich fruit. Our analysis found significant differences between the mean vectors of the features of the two species. Moreover, significant differences were found in wing length, wing width, third palp length, and fourth palp length—which was exactly what the original 1975 analysis had found!

  2. We performed classification using four different techniques: LDA, QDA, KNN, and binary logistic regression. All of these techniques yielded very satisfactory results.

  3. We also performed a principal components analysis and a PC factor analysis on the data. Two principal components alone explain most of the variation in the data. Moreover, the factor analysis after a varimax rotation yields a very meaningful, interpretable result: the latent factors in the data are the “non-antennal measurements” and the “antennal measurements”.

Appendix

We are heavily indebted to the original data analysis by William Atchley, which we shall enclose.

For further information on the HZ test of hypothesis, see here.